Improve retrying downloads code and testing#1559
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1559 +/- ##
==========================================
+ Coverage 88.34% 88.65% +0.30%
==========================================
Files 78 78
Lines 10735 10829 +94
==========================================
+ Hits 9484 9600 +116
+ Misses 1251 1229 -22
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
candleindark
left a comment
There was a problem hiding this comment.
At this point, _check_if_more_attempts_allowed is equally, if not more, about sleeping/waiting before sending out another requests. I think it would be beneficial to rename it, or in some other way, to indicate this nature.
If we want to improve the retrying further, I think we can consider incorporating some well-known and recommended algorithms for retrying requests to a server while avoiding overwhelming it. https://chatgpt.com/share/676f8290-86cc-800a-9aa8-1ad952282d27.
dandi/download.py
Outdated
| ) | ||
| lgr.debug( | ||
| "%s - download failed due to response %d with " | ||
| "Retry-After=%d: %s, will sleep and retry", |
There was a problem hiding this comment.
the "Retry-After=[]" in this log is now insistent and sleep_amount may not be a meaningful number if retry_after failed to be parsed into a datetime object
There was a problem hiding this comment.
sorry, I do not get what [] you have in mind since it can't be a list... please rereview when I push changes -- may be the issue is addressed.
There was a problem hiding this comment.
I think I meant that you have numerous logs with inconsistent labelling with "Retry-After=". Some of them you set it to the header value of the "Retry-After" (understandable), others you set it to sleep_amount which can originate from, but not exactly, the passed in header value for "Retry-After" and can default to -1.0.
There was a problem hiding this comment.
please check if it is all consistent now -- seems to me, always seems to point to retry_after.
There was a problem hiding this comment.
The Retry-After=[]s in log messages, where [] is used to indicate an expression of %d or %s, are consistent within the get_retry_after(), but this last one is still inconsistent with the ones in get_retry_after(). This last one prints the int value of sleep_amount, the result of get_retry_after(), and the others print whatever the value of the HTTP header Retry-After which can be a string of a specific date.
I think we can just remove the use of Retry-After= here. There is plenty of logged messages within get_retry_after()
There was a problem hiding this comment.
I actually think this log statement should be removed entirely. Because of the containing if-statement, this log statement doesn't always get executed for all responses that have a Retry-After header. However, at the same time, I think the original intent is to execute this log statement for all responses with a Retry-After header.
We are better off if we just replace the entire if statement:
if (sleep_amount := get_retry_after(exc.response)) is not None:
lgr.debug(
"%s - download failed due to response %d with "
"Retry-After=%d: %s, will sleep and retry",
path,
exc.response.status_code,
sleep_amount,
exc,
)with
sleep_amount = get_retry_after(exc.response)and unindent the next log statement as I recommended below.
There was a problem hiding this comment.
thanks! done that, pushed
datetime.UTC was introduced only in 3.11, so type checking which uses 3.9 ATM fails rightfully, as unittesting on 3.9 and 3.10
…ETRY_CODES are used It also add additional checks/protection against odd retry-after results (too long negatives or positives), which should address some concerns raised in prior code review
* origin/master: Update CHANGELOG.md [skip ci] Constrain version pin Further refine version pinning Pin `dandischema` to require the latest version Start Zarr download as soon as first page of entries is obtained Update CHANGELOG.md [skip ci] Use Python 3.10 to build docs Support dandischema v0.11.0 Fix typo in `dandi move` docstring Don't use version 0.25.5 of `responses` Update URL for DANDI Docs
…rm how long to sleep
Co-authored-by: John T. Wodder II <jwodder@users.noreply.github.com>
Co-authored-by: John T. Wodder II <jwodder@users.noreply.github.com>
Co-authored-by: John T. Wodder II <jwodder@users.noreply.github.com>
|
rerunning windows which puked with |
dandi/utils.py
Outdated
| elif sleep_amount < 0: | ||
| sleep_amount = None | ||
| lgr.warning( | ||
| "date in Retry-After=%r is in the past (current is %r). " |
There was a problem hiding this comment.
You may also want to do "Retry-After=%s" instead of "Retry-After=%r" if you want the HTTP header appears exactly the same as they are received, without the quotes.
dandi/download.py
Outdated
| ) | ||
| lgr.debug( | ||
| "%s - download failed due to response %d with " | ||
| "Retry-After=%d: %s, will sleep and retry", |
There was a problem hiding this comment.
The Retry-After=[]s in log messages, where [] is used to indicate an expression of %d or %s, are consistent within the get_retry_after(), but this last one is still inconsistent with the ones in get_retry_after(). This last one prints the int value of sleep_amount, the result of get_retry_after(), and the others print whatever the value of the HTTP header Retry-After which can be a string of a specific date.
I think we can just remove the use of Retry-After= here. There is plenty of logged messages within get_retry_after()
dandi/download.py
Outdated
| lgr.debug( | ||
| "%s - download failed on attempt #%d: %s, will sleep a bit and retry", | ||
| "%s - download failed on attempt #%d: %s, will sleep %f and retry", | ||
| path, | ||
| attempt, | ||
| exc, | ||
| sleep_amount, |
There was a problem hiding this comment.
I think just unindent these lines, and get the log statement out of the if-statement and remove the last log statement at line 1126. See the comments for the last log statement for more details.
There was a problem hiding this comment.
ok, makes sense, dedenting the log so we always log how long we are to sleep, not only if random amount due to no Retry-After. I did remove that other log
dandi/download.py
Outdated
| ) | ||
| lgr.debug( | ||
| "%s - download failed due to response %d with " | ||
| "Retry-After=%d: %s, will sleep and retry", |
There was a problem hiding this comment.
I actually think this log statement should be removed entirely. Because of the containing if-statement, this log statement doesn't always get executed for all responses that have a Retry-After header. However, at the same time, I think the original intent is to execute this log statement for all responses with a Retry-After header.
We are better off if we just replace the entire if statement:
if (sleep_amount := get_retry_after(exc.response)) is not None:
lgr.debug(
"%s - download failed due to response %d with "
"Retry-After=%d: %s, will sleep and retry",
path,
exc.response.status_code,
sleep_amount,
exc,
)with
sleep_amount = get_retry_after(exc.response)and unindent the next log statement as I recommended below.
… Retry-After= etc Co-authored-by: Isaac To <candleindark@users.noreply.github.com> Was amended locally later due to a dirty suggestion applied
|
🚀 PR was released in |
Follow up to
Continue retrying downloads on retriable statuses #1558
retry on 429 as well
handle Retry-After as a Date not just seconds (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Date)
adjust logging of sleeping upon retry - do in a single place
We are into holidays already, so it is understood if review does not come. I will reserve the "right" to merge if there is user demand but likely we could just wait. If anyone sees "clearly" how to improve -- feel welcome to push directly.